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position  and  orientation,  the  alignment  method  avoids  structuring 
the  recognition  process  as  an  exponential  search.  To  demonstrate 
the  method,  we  present  some  examples  of  recognizing  flat,  rigid 
objects  with  arbitrary  three-dimensional  position,  orientation, 
and  scale,  from  a  single  two-dimensional  image.  The  recognition 
system  chooses  features  for  alignment  using  a  scale-space 
segmentation  of  edge  contours.  Segments  are  described  in  terms  of 
both  their  shape  and  the  structure  of  the  scale-space  hierarchy 
at  the  next  finer  level,  producing  distinctive  features  for  use 
in  finding  possible  alignments.  Finally,  the  method  is  extended 
to  the  domain  of  non-flat  objects  as  well. 
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1  Introduction 


Object  recognition  involves  identifying  a  correspondence  between  part  of  an  image 
and  a  particular  view  of  a  known  object.  This  requires  matching  the  image  against 
stored  object  models  to  determine  if  any  of  the  models  could  produce  a  portion  of 
the  image.  For  the  general  recognition  problem,  the  number  of  possible  models  is 
very  large.  Most  existing  recognition  systems,  however,  consider  only  one  or  a  small 
number  of  object  models  (see  [4]  for  a  recent  review  of  recognition  systems). 

Even  for  a  single  model,  a  given  object  can  produce  substantially  different 
images  depending  on  its  position  and  angle  with  respect  to  the  viewer.  First,  from 
a  particular  view,  part  of  an  object  may  be  occluded.  Second,  an  object  may  be 
distorted  by  projection  into  the  image  plane  (e.g.,  foreshortening).  Finally,  an  object 
may  itself  undergo  transformations  such  as  having  parts  that  move  independently, 
or  being  stretched  or  bent.  Most  recognition  systems  assume  that  objects  are  rigid, 
and  do  not  undergo  any  transformation  [16]  [5]  [11]  [3].  Some  systems  allow  for 
perspective  projection  [21],  and  some  have  parameterized  models  with  parts  that 
can  articulate  [8]. 

The  presence  of  more  than  one  object  in  an  image  also  complicates  the  recogni¬ 
tion  problem.  First,  objects  may  occlude  one  another.  Second,  different  objects  in 
the  image  must  somehow  be  individuated.  In  the  case  of  touching  and  overlapping 
objects  this  generally  cannot  be  done  prior  to  recognition,  but  rather  must  be  part 
of  the  recognition  process  itself. 

Considerable  attention  has  been  paid  to  the  problem  of  recognizing  planar 
objects  with  two-dimensional  positional  uncertainty  (we  will  refer  to  this  as  the  2^ 
recognition  problem).  There  are  several  2D  recognition  systems  that  can  find  a 
given  object  in  a  grey-scale  image,  even  when  the  oc.^ect  is  partially  occluded  [16] 

[5]. 

Recognizing  objects  in  three-space  has  generally  been  approached  using  a  depth 
map,  which  specifies  the  distance  from  the  sensor  at  each  pixel  (the  3D  from  3D 
recognition  problem).  A  depth  map  can  be  derived  from  a  laser  scanner,  stereo 
matcher,  or  shape  from  motion,  shading,  contours,  etc.  Relatively  successful  sys¬ 
tems  have  been  developed  for  the  3D  from  3D  recognition  task,  using  laser  scanners 
to  derive  a  depth  map  [17]  [6]. 

Lowe’s  recent  work  [21]  addresses  the  problem  of  recognizing  objects  with  three- 
dimensional  positional  uncertainty  given  a  single  two-dimensional  view  (the  3D  from 
2D  recognition  problem).  In  3D  from  2D  recognition,  the  sensory  data  only  partially 
specifies  the  position  of  the  object.  Thus  it  appears  to  be  more  difficult  than  3D  from 
3D  recognition.  People,  however,  seem  to  be  good  at  this  task,  making  it  unclear 
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whether  three-dimensional  sensory  input  is  actually  necessary  for  recognition. 

The  Task 

In  this  paper  we  consider  the  problem  of  matching  a  two-dimensional  view  of  a 
rigid  object  against  a  potential  model.  The  viewed  object  can  have  arbitrary  three- 
dimensional  position,  orientation,  and  scale,  and  may  be  touching  or  occluded  by 
other  objects.  First  we  consider  the  domain  of  flat  rigid  objects  such  as  the  wid¬ 
get  shown  in  Figure  1.  While  the  viewed  object  is  flat,  the  problem  is  not  two- 
dimensional  because  a  flat  object  positioned  in  three-space  can  undergo  distortion 
such  as  foreshortening  when  projected  into  the  image  plane.  This  task,  like  the 
general  recognition  task,  suffers  from  problems  of  occlusion  and  of  individuating 
multiple  objects  in  an  image.  There  is  also  a  limited  kind  of  shape  distortion 
caused  by  projecting  a  rigid  object  into  the  image.  We  then  consider  extending 
the  method  to  the  domain  of  rigid  objects  in  general,  such  as  the  station  wagon  in 
Figure  2. 


Figure  1.  A  widget  used  in  recognition. 

The  current  task  cannot  be  handled  by  recognition  systems  that  assume  rigid  objects 
with  no  distortion  [16]  [5]  [11]  [3],  or  by  systems  that  only  allow  parameterized 
variation  of  rigid  models  [8].  The  task  is  similar  to  that  of  Lowe,  who  addresses  the 
problem  of  three-dimensional  recognition  from  a  single  two-dimensional  view  [21]. 
The  task  considered  by  Lowe  is  more  restricted,  however,  because  it  is  assumed  that 
objects  are  polyhedral,  and  are  viewed  such  that  parallel  surfaces  appear  more  or 
less  parallel. 

In  order  to  solve  this  task,  we  present  a  new  approach  to  recognition  in  which 
the  process  of  matching  a  model  to  an  image  is  divided  into  two  stages.  In  the  first 
stage,  the  model  is  aligned  with  the  image  using  a  small  number  of  model  and  image 
;  features.  In  the  second  stage,  the  alignment  is  used  to  transform  the  model  into 

image  coordinates.  Once  the  position  and  orientation  have  been  determined,  the 
model  can  be  compared  directly  with  the  image.  The  key  observation  underlying 
i  the  alignment  operation  is  that  the  position  and  orientation  of  a  rigid  object  can  be 

i 
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Figure  2.  A  station  wagon  used  in  recognition. 


determined  from  a  small  number  of  position  and  orientation  measures.  In  contrast, 
many  recognition  systems  search  for  the  largest  set  of  model  and  image  feature 
pairs  that  are  consistent  with  a  single  position  and  orientation  of  a  rigid  object. 
The  number  of  such  sets  is  exponential,  requiring  the  use  of  various  techniques  to 
limit  the  search. 


Aligning  a  Model  With  an  Image 

For  2D  recognition,  only  two  pairs  of  corresponding  model  and  image  points  are 
needed  to  align  a  model  with  an  image.  Consider  two  pairs,  ( am,ai )  and  (6m,  6,), 
such  that  model  point  am  corresponds  to  image  point  a,  and  model  point  bm  cor¬ 
responds  to  image  point  6j.  Figure  3a  shows  the  edge  contours  of  two  widgets, 
labeled  with  these  four  points.  The  two-dimensional  alignment  of  the  contours  has 
three  steps.  First  the  model  is  translated  such  that  am  is  coincident  with  a,  as 
shown  in  Figure  3b.  Then  it  is  rotated  about  the  new  am  such  that  the  edge  ambm 
is  coincident  with  the  edge  a,6j  as  shown  in  part  (c).  Finally  the  scale  factor  is 
computed  to  make  bm  coincident  with  am,  as  shown  in  part  (d).  These  two  trans¬ 
lations,  one  rotation,  and  a  scale  factor  make  each  rr  --'eluded  point  of  the  model 
coincident  with  its  corresponding  image  point,  as  long  as  the  initial  correspondence 
of  (am,ai)  and  (bm,bi)  is  correct. 

For  3D  from  2D  recognition,  the  alignment  method  is  similar,  requiring  three 
pairs  of  model  and  image  points  to  perform  a  three-dimensional  transformation 
and  scaling  of  the  model.  Section  6  presents  the  3D  from  2D  alignment  method 
in  detail,  showing  how  to  use  three  pairs  of  model  and  image  points  to  position 
and  orient  a  model  in  three-space  given  a  single  two-dimensional  view,  assuming 
orthographic  projection.  A  transformation  from  the  model  to  the  image  consists  of 
two-dimensional  translation,  three-dimensional  rotation,  and  a  linear  scale  factor 
that  is  proportional  to  the  viewing  distance.  Under  normal  viewing  conditions, 
orthographic  projection  plus  scale  is  a  reasonable  approximation  to  perspective 
viewing.  Under  high  perspective  distortion  -  when  the  object  occupies  much  of  the 
field  of  view  -  the  approximation  is  poor.  It  appeals,  however,  that  under  such 
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Figure  3.  Two  dimensional  translation,  rotation  and  scaling  of  one  object  to 
match  another. 

conditions  recognition  is  also  difficult  for  humans. 

The  alignment  method  requires  identifying  potentially  corresponding  model 
and  image  points  such  as  (am,  a*)  in  Figure  3.  These  pairs  are  then  used  to  determine 
possible  alignments  of  the  model  with  the  image.  Local  orientation  measures  can 
also  be  used  to  solve  for  possible  alignments.  The  problem  of  finding  points  and 
orientations  for  alignment  is  addressed  in  Section  3  and  Section  4.  In  Section  5, 
a  system  for  recognizing  flat  objects  with  three-dimensional  positional  freedom  is 
described.  Some  recognition  examples  are  also  presented  in  that  section.  The  details 
of  the  3D  from  2D  alignment  computation  are  given  in  Section  6,  and  the  method 
is  extended  to  handle  non-planar  objects  in  Section  7. 


2  Matching  Models  and  Images:  Previous  Approaches 


Object  recognition  is  generally  viewed  as  a  two  stage  process.  First,  possible  object 
models  are  hypothesized.  Second,  each  hypothesis  is  tested  to  determine  if  the 
model  can  be  positioned  and  oriented  such  that  it  matches  the  image  data.  Most 
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recof  nition  systems  address  the  problem  of  matching  a  model  to  an  image,  assuming 
that  an  appropriate  model  has  already  been  hypothesized  [8]  [5]  [11].  Some  recent 
work  has  addressed  the  problem  of  hypothesizing  relevant  models,  but  only  for 
two-dimensional  recognition,  and  relatively  small  object  libraries  [20]. 

Assuming  that  one  or  a  small  number  of  possible  objects  have  been  hypothe¬ 
sized,  there  are  several  systems  for  matching  a  model  of  a  rigid  object  to  an  image 
(cf.  [4]).  These  systems  all  exploit  rigidity  by  noting  that  for  a  given  position  and 
orientation  of  a  rigid  object,  there  must  be  a  single  transformation  that  maps  each 
model  feature  onto  its  corresponding  image  feature.  This  transformation  consists 
of  a  three-dimensional  rotation  and  translation  in  3D  from  3D  recognition,  and  a 
solution  to  the  perspective  viewing  equation  in  3D  from  2D  recognition  [21]. 

In  this  framework,  recognition  is  generally  structured  as  a  search  for  the  largest 
pairing  of  model  and  image  features  for  which  there  is  a  single  transformation 
mapping  each  model  feature  to  its  corresponding  image  feature  [11]  [5]  [16]  [8]  [21]. 
For  i  image  features  and  m  model  features  there  are  at  most  p  =  i  x  m  pairs  of 
features.  Because  of  occluded  image  points,  and  image  points  that  do  not  correspond 
to  the  model,  any  subset  of  these  p  pairs  could  be  the  largest  set  of  matching  model 
and  image  points,  and  thus  the  number  of  possible  matches  is  exponential  in  the 
size  of  p. 

Two  methods  are  used  to  limit  the  number  of  possible  matchings  of  model  and 
image  features.  The  first  is  to  use  the  identity  of  features  to  specify  which  model 
features  can  match  which  image  features,  thereby  reducing  the  number  of  pairs,  p 
[5]  [21].  Even  in  the  case  that  each  model  feature  has  only  a  single  corresponding 
image  feature,  however,  there  may  be  multiple  matches  because  the  image  features 
may  actually  correspond  to  some  other  object. 

A  problem  with  using  the  identity  of  features  in  recognition  is  that  there  is  a 
tradeoff  between  the  uniqueness  of  a  feature  and  the  robustness  with  which  it  can 
be  recognized.  Because  systems  that  rely  heavily  on  the  identity  of  features  must 
use  highly  distinctive  features,  they  tend  to  be  sensitive  to  noise  and  occlusion  in 
the  image. 

The  second  method  for  limiting  the  search  is  to  use  relations  between  features 
to  eliminate  pairs  of  model  and  image  features  that  are  inconsistent  [16]  [5]  [11].  For 
instance,  in  order  for  two  pairs  of  model  and  image  features  (am,a,)  and  ( bm,b, ) 
to  be  part  of  a  consistent  match,  the  distance  between  the  image  features  a,  and  6, 
must  be  the  same  as  the  distance  between  the  model  features  am  and  bm,  within 
some  error  bound.  Similarly,  the  angle  between  orientation  measures  for  any  pair 
of  image  features  must  match  the  angle  between  the  corresponding  pair  of  model 
features. 

A  problem  with  using  relations  between  features  in  recognition  is  that  the 
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relations  must  be  measurable  in  the  image.  Since  relations  such  as  distance  and 
angle  are  not  invariant  under  projection,  three-dimensional  recognition  systems  that 
use  these  relations  require  three-dimensional  data.  Relations  that  are  invariant 
under  projection  tend  to  be  much  weaker  than  distance  and  angle  relations. 


Using  the  Identity  of  Features 

This  section  considers  several  recognition  systems  that  rely  heavily  on  the  identity 
of  individual  features.  The  features  used  in  recognition  are  generally  local  shape 
properties,  or  parts,  of  an  object.  Global  properties  such  as  moments  of  inertia  (cf. 
[4])  are  also  used  as  features  for  matching.  Global  properties  of  objects  are  highly 
sensitive  to  occlusion,  however,  making  them  inappropriate  for  tasks  such  as  the 
ones  considered  here. 

The  LFF  [5]  and  3DP0  [6]  systems  look  for  a  variety  of  pre-determined  features 
such  as  comers  and  holes,  which  are  grouped  together  based  on  proximity.  A  “focus 
feature”  in  each  group  is  chosen  for  use  in  matching.  This  feature  is  described  in 
terms  of  its  type  (e.g.,  comer,  hole),  and  the  type,  distance,  and  angle  of  the  other 
features  in  the  cluster.  By  clustering  local  features,  the  LFF  system  produces  highly 
distinctive  labels  for  the  focus  features.  Using  proximity  to  cluster  image  features, 
however,  may  yield  clusters  composed  of  features  from  multiple  objects.  Such  image 
clusters  will  generally  not  match  the  correct  model  cluster.  This  makes  the  system 
sensitive  to  the  position  and  orientation  of  neighboring  and  occluding  objects. 

The  SCERPO  3D  from  2D  recognition  system  [21]  [22]  uses  proximity  and 
parallelism  to  group  edge  fragments  together  into  simple  features.  As  in  the  LFF 
system,  the  use  of  proximity  can  be  problematic  when  there  are  neighboring  or 
occluding  objects.  Since  parallelism  is  not  invariant  under  viewing  position,  “al¬ 
most  parallel”  segments  are  also  grouped  together.  This  too,  however,  may  fail 
under  some  viewing  conditions.  The  strong  reliance  on  parallelism  restricts  this 
classification  scheme  to  polyhedral  objects,  where  parallel  surfaces  appear  almost 
parallel. 

Various  shape  descriptions  have  been  proposed  that  use  local  features  to  de¬ 
scribe  objects  (SLS  [7],  LRS  [14],  curvature  primal  sketch  [2],  codons  [19]).  Some 
of  these  representations  have  been  used  in  recognition  systems  (e.g.,  SLS  [10]),  and 
others  have  been  discussed  as  possible  representations  for  recognition  (e.g.,  codons 
[19]).  These  shape  descriptions  are  intended  to  produce  highly  unique  features  for 
use  in  matching  models  against  images.  Thus  an  object  model  is  matched  to  an 
image  using  primarily  the  identity  of  individual  image  and  model  features,  ruther 
than  structural  relations  between  features. 

The  curvature  primal  sketch  [2]  is  a  shape  description  composed  of  a  hierarchy 
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of  approximations  to  edge  contours  at  various  scales.  At  a  given  scale,  the  curvature 
primal  sketch  consists  of  a  cubic  spline  approximation  to  a  two-dimensional  edge 
contour.  The  endpoints  of  the  splines  occur  at  local  maxima  in  the  smoothed  cur¬ 
vature  of  the  edge  contour.  There  are  two  major  problems  with  using  the  curvature 
primal  sketch  for  recognition.  First,  splines  connecting  maximal  curvature  points 
are  only  a  good  approximation  at  relatively  fine  scales,  when  the  local  maxima  in 
curvature  do  not  occur  very  far  apart.  Second,  maximum  curvature  points  are  un¬ 
stable  under  three-dimensional  rotation  and  projection.  This  issue  is  considered  in 
Section  4  on  finding  points  for  use  in  alignment. 

Symmetry-based  shape  descriptions  [7]  [14],  including  generalized  cones  or 
cylinders  [23]  [8],  discard  a  lot  of  shape  information.  Symmetry  is  a  shape  attribute 
that  can  only  be  computed  at  relatively  large  scales,  making  it  difficult  to  encode 
detailed  shape  information  in  such  a  representation.  Because  symmetry  is  a  rela¬ 
tively  global  shape  property  it  is  also  quite  sensitive  to  occlusion.  The  information 
preserved  by  symmetry  representations,  such  as  lengths  and  orientations  of  axes,  is 
not  invariant  under  projection,  making  such  representations  inappropriate  for  3D 
from  2D  recognition  tasks.  Symmetry  axes  can,  however,  be  used  as  orientation 
measures  for  alignment,  as  described  in  Section  4. 

Using  Relations  Between  Features 

At  the  other  extreme  are  systems  that  do  not  use  feature  identity  at  all,  but  rather 
rely  exclusively  on  structural  relations  between  image  and  model  points.  Grimson 
and  Lozano-Perez  [16]  [17]  structure  recognition  as  a  search  through  a  tree  of  all 
pairs  of  model  and  image  points.  A  given  level  of  the  tree  pairs  a  particular  image 
point  with  each  model  point.  Distance  and  angle  relations  between  pairs  of  points 
are  used  to  prune  the  tree.  If  at  any  point  along  b  Dath,  a  node  specifies  a  pair 
of  points  that  are  inconsistent  with  some  previous  node  on  that  path,  then  the 
remainder  of  the  path  is  not  explored  because  it  cannot  lead  to  a  consistent  set  of 
pairs.  In  order  to  handle  image  points  from  other  objects,  there  is  a  special  model 
point  called  the  null  face  that  will  match  any  image  point.  The  null  face  is  expanded 
last  in  searching  the  tree,  and  the  longest  path  (not  counting  null  face  branches)  is 
always  expanded  first. 

It  has  been  demonstrated  that  this  tree  search  converges  rapidly  for  both  2D 
from  2D  [16]  and  3D  from  3D  [17]  recognition  tasks.  T  \e  success  of  the  algorithm 
is  due  to  the  power  of  the  distance  and  angle  constraints  to  prune  the  exponential 
tree  of  possible  matches.  The  fact  that  any  model  and  image  points  are  allowed 
to  match  makes  the  system  robust  in  the  face  of  partial  information,  such  as  when 
there  is  substantial  occlusion.  The  strong  reliance  on  distance  and  angle  relations, 
however,  makes  the  method  inapplicable  to  the  problem  of  3D  from  2D  recognition. 
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In  addition  to  empirical  demonstrations  of  the  power  of  distance  and  angle 
relations  among  features,  it  has  been  shown  that  given  a  noise  bound  for  the  sensory 
data,  pairwise  distance  relations  can  be  used  to  develop  an  0(n2)  time  algorithm  for 
recognizing  an  isolated  object  with  n  features,  which  can  undergo  two-dimensional 
rotation  and  translation  [3]. 

The  LFF  and  3DP0  systems  [5]  [6]  use  distance  and  angle  relations  between 
features  as  well  as  the  identity  of  individual  features.  Model  and  image  features 
of  the  same  type  are  paired  together.  The  space  of  possible  subsets  of  these  pairs 
is  then  searched  using  a  graph  matching  algorithm.  Pairwise  distance  and  angle 
relations  are  used  to  form  a  graph  structure.  Each  pair  of  model  and  image  features 
is  represented  by  a  node,  and  each  consistent  pair  of  nodes  is  connected  by  an  arc. 
A  maximum  clique  of  this  graph  constitutes  a  largest  pairwise  consistent  match  of 
model  and  image  features. 

This  method  has  been  shown  to  find  a  maximally  consistent  match  for  a  variety 
of  images  in  a  two-dimensional  recognition  task.  Much  of  the  success  of  the  algo¬ 
rithm  is  attributable  to  the  use  of  local  feature  clusters  to  restrict  the  number  of 
possible  initial  pairs  of  model  and  image  features.  As  discussed  above,  this  reliance 
on  local  context  makes  the  system  sensitive  to  overlapping  objects. 

ACRONYM  [8]  uses  both  the  identity  of  features  and  the  relations  between 
features  in  recognition.  The  features  are  generalized  cones,  which  are  described 
in  terms  of  axes  and  cross  sections.  The  presence  of  a  given  feature  is  used  to 
predict  the  positions  and  orientations  of  other  features.  Therefore,  like  the  other 
systems  discusssed  in  this  section,  given  two-dimensional  data  ACRONYM  can 
perform  only  2D  recognition.  While  ACRONYM’S  geometric  modeling  component 
is  heavily  three-dimensional,  the  system  was  tested  using  aerial  photographs,  which 
are  two-dimensional  in  nature. 

The  SCERPO  system  [21]  [22]  is  the  only  one  that  addresses  the  problem  of 
three-dimensional  recognition  from  a  single  two-dimensional  view.  The  problem  of 
finding  a  best  match  is  structured  as  a  tree  search  (similar  to  [16]),  however  the  check 
for  a  consistent  match  is  different  from  previous  systems.  A  given  set  of  model  and 
image  feature  pairs  are  consistent  if  there  is  a  solution  to  the  perspective  viewing 
equation  that  maps  each  model  feature  onto  its  corresponding  image  feature.  There 
are  no  simple  pairwise  checks  such  as  distance  and  angle  relations  that  can  be  used 
to  test  the  consistency  of  a  set  of  pairs  incrementally,  given  an  added  pair.  Instead, 
the  perspective  viewing  equation  is  solved  for  each  added  pair  of  model  and  image 
points.  This  is  done  using  Newton- Raphson  iteration,  which  allows  a  solution  to  be 
modified  to  account  for  a  new  pair. 

None  of  the  recognition  systems  discussed  in  this  section  ran  effectively  han¬ 
dle  the  task  of  recognizing  objects  that  have  arbitrary  three-d  mensional  pos  fion, 
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orientation,  and  scale,  from  a  single  two-dimensional  view.  Each  system  solves  a 
more  restricted  recognition  task,  in  order  to  limit  the  space  of  possible  matchings 
of  model  and  image  features  (e.g.,  requiring  distance  and  angle  relations  to  be  mea¬ 
surable,  or  parallel  surfaces  to  appear  parallel).  The  next  section  describes  how  to 
recognize  an  object  by  first  aligning  it  with  an  image.  This  use  of  alignment  avoids 
having  to  structure  the  recognition  process  as  an  exponential  search. 


3  The  Alignment  Method  of  Recognition 


We  have  seen  above  that  recognition  can  be  viewed  as  a  search  through  the  space 
of  all  possible  positions  and  orientations  of  all  possible  objects.  The  idea  of  the 
alignment  approach  is  to  separate  this  search  into  two  stages.  In  the  first  stage,  the 
position,  orientation,  and  scale  of  an  object  are  found  using  a  minimal  amount  of 
information,  such  as  three  pairs  of  model  and  image  points.  In  the  second  stage,  the 
alignment  is  used  to  map  the  object  model  into  image  coordinates  for  comparison 
with  the  image. 

There  are  two  major  advantages  of  this  approach.  First,  by  using  a  small  fixed 
number  of  model  and  image  features,  the  method  avoids  structuring  recognition  as 
a  search  through  an  exponential  space.  Second,  a  given  alignment  can  be  used  to 
match  multiple  models  against  the  image.  If  a  group  of  objects  are  stored  such  that 
they  are  aligned  with  one  another,  then  a  single  alignment  computation  will  map 
the  entire  group  of  objects  into  the  image. 

The  key  observation  behind  the  approach  is  thsf  the  alignment  can  be  per¬ 
formed  with  a  small  amount  of  information.  For  example,  three  image  points  and 
three  corresponding  model  points  are  sufficient  to  determine  the  position,  orien¬ 
tation  and  scale  of  a  rigid  object  in  three-space.  Similarly,  two  points  and  an 
orientation  measure  can  also  be  used  to  solve  for  the  three-dimensional  alignment. 

Consider  an  object,  0 ,  with  three-dimensional  positional  freedom,  and  a  two- 
dimensional  image,  I,  which  contains  a  view  of  0  (perhaps  along  with  other  objects). 
We  are  interested  in  losing  the  alignment  computation  to  find  0  in  the  image. 
Assume  that  a  feature  detector  returns  a  set  of  potentially  matching  model  and 
image  feature  pairs,  P  (one  such  detector  is  described  in  Section  5).  Since  three  pairs 
of  model  and  image  features  specify  a  potential  alignment  of  a  model  with  an  image, 
any  triplet  in  P  may  specify  the  position  and  orientation  of  the  object.  In  general, 
some  small  number  of  triplets  will  specify  the  correct  position  and  orientation,  and 
the  rest  will  be  due  to  incorrect  matchings  of  model  and  image  points.  Thus  the 
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recognition  problem  is  to  determine  which  alignment  in  P  defines  the  transformation 
that  best  maps  the  model  into  the  image. 

Given  a  set  of  pairs  of  model  and  image  features,  P,  we  solve  for  the  alignment 
specified  by  each  triplet  in  P.  For  some  triplets,  there  will  be  no  way  to  position 
and  orient  the  three  model  points  such  that  they  project  onto  their  corresponding 
image  points.  Such  triplets  do  not  specify  a  possible  alignment  of  the  model  and 
the  image.  The  remaining  triplets  each  specify  a  transformation  mapping  model 
points  to  image  points.  An  alignment  is  scored  by  using  the  transformation  to  map 
the  model  edges  into  the  image,  and  comparing  the  transformed  model  edges  with 
the  image  edges.  The  best  alignment  is  the  one  that  maps  the  most  model  edges 
onto  image  edges. 

For  m  model  features  and  i  image  features,  the  number  of  pairs  of  model  and 
image  features,  p,  is  at  most  i  x  m.  With  a  good  labeling  scheme,  the  number 
of  pairs,  p,  will  be  much  smaller,  approaching  m  when  each  model  point  has  one 
corresponding  image  point.  Given  p  pairs  of  features,  there  are  (£),  or  an  upper 
bound  of  0(p3),  triplets  of  pairs.  Each  triplet  specifies  a  possible  alignment  of  the 
model  and  the  image.  An  alignment  is  scored  by  mapping  the  model  edges  into  the 
image.  If  the  model  edges  are  of  length  /,  then  the  worst  case  running  time  of  the 
algorithm  is  0(lp3).  Thus  by  structuring  the  recognition  process  as  an  alignment 
stage  followed  by  a  comparison  stage,  it  is  transformed  from  the  exponential  problem 
of  finding  the  largest  consistent  set  of  model  and  image  points,  to  the  polynomial 
problem  of  finding  the  best  triplet  of  model  and  image  points. 

Since  the  number  of  possible  alignments  is  cubic  in  the  number  of  model  and 
image  feature  pairs,  it  is  important  to  label  features  distinctively  in  order  to  limit 
the  number  of  pairs.  If  the  number  of  pairs,  p,  is  small,  then  little  or  no  search 
is  necessary  to  find  the  correct  alignment.  For  instance,  if  p  =  3  there  is  only  one 
possible  alignment,  and  if  p  =  5  there  are  (*),  or  10,  possible  alignments.  The 
problem  of  labeling  features  is  discussed  in  Section  4  and  Section  5. 


Limiting  the  Number  of  Possible  Alignments 

We  have  formulated  the  recognition  problem  as  finding  the  alignment  that  best 
matches  the  model  edges  to  image  edges.  Since  any  triplet  of  model  and  image 
points  could  specify  that  alignment,  each  one  must  be  considered.  Having  solved 
for  a  possible  alignment,  additional  pairs  of  model  and  image  points  or  orientations 
can  be  used  to  check  the  validity  of  the  alignment,  before  projecting  the  model  into 
the  image.  The  fact  that  a  given  alignment  does  not  map  a  model  point  onto  a 
corresponding  image  point  does  not,  however,  mean  that  the  alignment  is  incorrect. 
It  could  also  be  the  case  that  the  additional  pair  of  model  and  image  points  is 


For  certain  kinds  of  model  and  image  pairs,  it  is  fairly  certain  that  the  corre¬ 
spondence  of  the  model  and  image  features  is  correct.  Thus,  a  failure  to  align  these 
additional  measures  indicates  an  incorrect  alignment.  For  instance,  if  the  feature 
detector  returns  not  only  a  location  and  a  label,  but  also  an  orientation  of  each 
feature,  then  it  is  fairly  certain  that  a  given  feature  point  and  orientation  belong  to 
the  same  viewed  object. 

When  an  orientation  measure  is  associated  with  each  model  and  image  point, 
the  alignment  computation  is  modified  so  that  each  of  the  three  model  orientations 
is  mapped  into  the  image,  and  compared  with  the  corresponding  image  orientation. 
If  all  three  match,  then  the  triplet  specifies  a  possible  alignment,  otherwise  it  is 
discarded. 

The  use  of  local  orientation  measures  to  filter  possible  alignments  is  different 
from  the  use  of  orientation  constraints  in  existing  recognition  systems  (e.g.,  [16] 
[11]  [5]).  Here,  each  model  orientation  undergoes  a  three-dimensional  alignment 
transformation  to  map  it  into  the  image,  where  it  is  compared  with  a  corresponding 
image  orientation.  In  contrast,  existing  systems  compare  the  angle  between  a  pair 
of  model  features  with  the  angle  between  a  pair  of  image  features.  This  requires 
that  the  image  measures  specify  three-dimensional  orientation  in  order  to  match 
them  with  a  three-dimensional  model. 

Alignment  and  Identification  are  Separate  Operations 

Finding  a  match  of  a  model  to  an  image  requires  both  aligning  the  object  with 
the  image,  determining  its  position  and  orientation,  and  identifying  the  object, 
determining  that  it  is  actually  present.  However,  the  operations  of  alignment  and 
identification  have  substantially  different  data  requirements. 

Aligning  a  model  with  an  image  requires  only  a  small  number  of  data  points. 
For  instance  three  pairs  of  model  and  image  points  are  sufficient  to  determine  the 
three-dimensional  position  and  orientation  of  an  object,  plus  a  linear  scale  factor.  A 
small  number  of  data  points  are  not  only  sufficient  for  alignment,  it  is  advantageous 
to  have  fewer  points,  because  the  number  of  possible  alignments  of  a  model  and  an 
image  is  cubic  in  the  number  of  model  and  image  point  pairs. 

In  contrast  to  alignment,  identification  requires  a  fa.rly  large  number  of  data 
points.  In  the  worst  case,  identification  can  require  almost  arbitrary  amounts  of 
data,  when  differentiating  between  two  similar  objects. 

The  difference  between  alignment  and  identification  is  most  pronounced  in  3D 
from  2D  recognition.  In  these  tasks,  the  position  and  orientation  should  ideally  be 
known  before  attempting  to  identify  an  object,  as  projection  distorts  an  object's 
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shape.  Because  the  alignment  method  solves  for  position  and  orientation  using  a 
minimal  amount  of  information,  it  is  particularly  well  suited  to  this  problem.  I' he 
differing  requirements  of  alignment  and  identification  have  led  us  to  propose  the 
alignment  method,  where  alignment  precedes  comparison. 

When  alignment  and  identification  are  combined,  moderately  detailed  and 
dense  features  must  be  used  in  order  to  be  able  to  distinguish  objects  from  one 
another.  The  large  number  of  features  makes  the  number  of  possible  positions  and 
orientations  very  large,  because  different  n-tuples  of  points  will  produce  slightly 
different  positions  even  though  they  correspond  to  the  “same”  interpretation  of  the 
object.  This  is  illustrated  by  Grimson  and  Lozano-Perez’s  system  [16],  which  finds 
a  large  number  of  highly  similar  interpretations,  differing  only  slightly  in  position  or 
orientation.  Since  these  interpretations  are  essentially  the  same,  they  are  grouped 
together  into  a  single  interpretation.  In  contrast,  by  using  a  small  number  of  coarse 
features  to  first  align  an  object  with  an  image,  there  will  be  relatively  few  possible 
positions  and  orientations  (some  of  which  may  still  be  very  similar). 

Another  consequence  of  combining  alignment  and  identification  is  the  difficulty 
of  determining  that  an  object  is  not  in  the  image.  All  possible  interpretations  must 
be  considered  and  found  inconsistent  before  an  object  can  be  rejected.  Since  the 
features  are  relatively  dense,  their  number  is  large,  and  many  inconsistent  inter¬ 
pretations  must  be  considered  and  rejected.  If  alignment  and  identification  are 
separated,  then  the  alignment  operation  alone  will  indicate  that  there  is  no  possible 
match  in  cases  of  a  substantial  mismatch.  In  other  cases,  some  comparison  will 
have  to  be  performed  as  well. 


On  Alignment  in  Human  Recognition 
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There  is  some  evidence  that  people  align  two  objects  before  comparing  them.  For 
instance,  people  appear  to  be  slower  at  judging  if  two  edge  contours  are  the  same 
when  they  are  presented  at.  different  orientations  compared  to  when  they  are  pre¬ 
sented  at  the  same  orientation  [26].  The  major  exception  to  this  is  edge  contours 
that  define  a  region  with  an  “intrinsic  axis”.  Such  contours  can  be  compared  rapidly 
even  when  presented  at  different  orientations  [29]. 


The  effect  is  illustrated  in  Figure  4.  Part  (a)  of  the  figrre  shows  a  cc  itour 
without  a  good  axis,  and  part  (b)  shows  a  contour  with  a  good  axis.  Comparison  is 
rapid  for  both  contours  when  the  orientation  is  the  same  When  the  orientation  is 
different,  however,  recognition  is  substantially  slower  for  contours  without  an  axis, 
as  in  (a),  than  for  ones  with  an  axis,  as  in  (b)  [29]. 

As  discussed  in  Section  1,  in  two  dimensions  two  points  are  sufficient  for  aligning 
a  model  with  an  image.  Similarly,  it  is  possible  to  use  a  point  and  an  orientation 
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Figure  4.  Orientation  affects  human  matching  speed,  a)  an  object  without  a 
good  axis  which  is  hard  to  match,  b)  an  object  with  a  good  axis  which  is  easy 
to  match. 

if  no  change  in  scale  is  allowed.  The  point  is  used  to  translate  the  model,  and  the 
orientation  is  used  to  rotate  it.  The  contour  in  (b)  has  a  distinguished  point  (e.g., 
the  base)  and  orientation  for  alignment,  whereas  the  one  in  (a)  does  not. 

The  results  of  these  studies  suggest  that  without  sufficient  information  to  per¬ 
form  alignment,  recognition  is  more  difficult,  perhaps  requiring  explicit  rotation  of 
an  object  to  compare  many  possible  orientations. 


4  Alignment  Points 

The  alignment  operation  requires  finding  pairs  of  corresponding  model  and  image 
points,  and  model  and  image  orientations.  Because  number  of  potential  align¬ 
ments  is  cubic  in  the  number  of  pairs,  it  is  desirable  to  limit  the  number  of  pairs  as 
much  as  possible.  There  must  be  at  least  three  pairs  that  correspond  to  the  image 
of  the  object,  however,  or  else  the  method  will  fail  to  find  an  alignment. 

To  limit  the  number  of  pairs  of  model  and  image  points  (or  orientations)  it 
is  necessary  to  associate  labels  with  these  measures,  and  only  pair  together  model 
and  image  data  with  the  same  label.  The  labels  must  be  relatively  insensitive  to 
partial  occlusion,  juxtaposition,  and  projective  distortion,  while  being  as  distinctive 
as  possible. 

There  are  several  sources  of  information  in  an  image  that  can  be  used  to  define 
and  label  points  and  orientations  for  alignment.  We  have  chosen  to  use  a  shape 
description  based  on  intensity  edges  to  define  labeled  edge  segments  that  are  used 
for  alignment.  The  method  is  described  in  Section  5,  below.  These  descriptions  can 
be  augmented  using  color,  texture,  shading  and  depth  information  to  form  more 
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distinctive  descriptions  of  the  alignment  points.  Wo  intend  tv'  exploit  -n,  !i  -.oni. 
of  information  in  the  next  version  of  the  system. 

It  is  also  possible  to  define  other  kinds  of  alignment  points  based  on  intensity 
edges.  For  instance,  vertices  where  multiple  edges  come  together  can  be  used  as 
alignment  points.  Other  kinds  of  alignment  points  are  cusps,  tips,  deep  concavities 
and  small  blobs.  In  the  current  implementation,  however,  we  have  concentrated  on 
points  and  local  orientations  derived  from  inflections  in  edge  contours. 

As  long  as  different  kinds  of  alignment  points  are  identified  by  distinct  labels, 
the  more  kinds  of  information  the  better  in  terms  of  being  able  to  quickly  identify 
the  correct  alignment  of  a  model  with  an  image.  A  large  amount  of  data  is  not  a 
problem,  only  a  large  amount  of  indistinguishable  data. 

Edge  Contour  Shape  Features 

Forming  a  shape  description  based  on  intensity  edges  requires  breaking  edge  con¬ 
tours  into  primitive  pieces.  Many  techniques  for  deriving  shape  descriptions  segment 
contours  at  maximal  curvature  points,  partly  because  of  their  supposed  psychologi¬ 
cal  importance.  The  studies  [1]  [12]  demonstrating  the  importance  of  maximal  cur¬ 
vature  points,  however,  do  not  address  the  problem  of  what  information  is  necessary 
to  recognize  an  object.  Rather,  the  studies  demonstrate  that  certain  information  is 
sufficient  to  recognize  an  object. 

For  instance,  Attneave’s  cat,  shown  in  Figure  5a,  is  constructed  by  linearly 
interpolating  between  the  maximal  curvature  points  in  a  drawing  of  a  cat.  The  fact 
that  the  drawing  is  still  easily  recognizable  has  been  used  to  claim  that  maximal 
curvature  points  are  of  special  significance.  Lowe  [21],  however,  points  out  that  the 
contour  in  Figure  5b  is  also  easily  recognizable  as  a  cat.  This  contour  is  constructed 
using  the  points  midway  between  maximal  curvature  points. 

Thus,  it  appears  that  there  is  sufficient  information  in  a  contour  that  any  one 
of  a  variety  of  sparse  descriptions  is  sufficient  for  people  to  be  able  to  recognize  an 
object.  Maximal  curvature  points,  per  se,  are  not  important. 

Because  it  is  possible  to  represent  contour  shape  using  various  sparse  repre¬ 
sentations,  the  choice  of  points  should  be  motivated  by  the  requirements  of  the 
recognition  task  being  addressed.  Maximal  curvature  points  are  highly  unstable 
under  three-dimensional  rotation  and  projection,  both  appearing  and  disappearing 
(without  being  occluded  by  other  parts  of  the  object),  making  them  inappropriate 
for  3D  from  2D  recognition.  For  example  an  ellipse  can  be  rotated  about  its  mi¬ 
nor  axis  to  obtain  a  circle  -  illustrating  maximal  curvature  points  that  disappear. 
This  circle  can  then  be  rotated  around  another  axis  to  obtain  a  different  ellipse  - 
illustrating  maximal  curvature  points  that  appear. 


mmmmm 
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Figure  5.  Attneave’s  cat,  which  is  intended  to  demonstrate  the  importance  of 
maximal  curvature  points  for  representing  shape,  and  Lowe’s  cat  which  shows 
other  points  work  as  well. 

In  contrast  to  curvature  maxima,  zero  crossings  of  curvature  (or  inflection 
points  in  the  contour)  are  relatively  stable  under  projection,  only  disappearing 
when  the  contour  is  projects  to  a  straight  line.  False  inflection  points  only  appear 
when  one  piece  of  an  object  partly  occludes  another. 

Low  curvature  regions  pose  a  problem  because  very  small  changes  in  curvature 
may  yield  “inflections”.  We  define  significant  inflections  to  occur  only  in  regions 
where  the  curvature  is  not  in  the  range  [—e,  e].  The  recognition  system  described 
in  the  next  section  uses  significant  inflection  points  and  low-curvature  regions  to 
segment  edge  contours. 


5  A  3D  from  2D  Recognition  System 


This  section  describes  a  system  that  implements  the  alignment  method  of  recogni¬ 
tion.  The  system  is  consistent  with  the  requirements  outlined  by  the  theory,  but  is 
more  specific  in  many  respects  and  thus  reflects  certain  choices  of  implementation 
that  are  only  one  of  several  possible  ways  of  doing  things. 

In  the  current  implementation,  the  features  used  by  the  system  are  obtained 
from  inflections  in  the  intensity  edges  in  a  two-dimensional  image.  These  features, 
however,  can  be  augmented  using  other  sources  of  inform;  tion,  as  mentioned  above. 

Shape  Features:  Segmenting  Edge  Contours 

The  recognition  system  forms  edge-based  shape  features  for  use  in  aligning  models 
with  images.  The  input  to  the  system  is  a  grey-level  image,  which  is  processed  by 
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an  edge  detector  [9].  The  output  of  the  edge  detector  is  noisy,  containing  edges  due 
to  shadows  and  texture  as  well  as  object  edges.  This  causes  two  problems.  First, 
the  presence  of  many  weak  edges  will  make  it  difficult  to  group  edges  together  into 
larger  shape  units.  Second,  missing  pieces  of  edge  contour  will  make  it  unclear 
whether  two  pieces  of  edge  contour  are  part  of  the  same  underlying  edge  in  the  real 
world.  The  second  problem  is  somewhat  easier  to  deal  with,  so  the  strategy  utilized 
here  is  to  discard  relatively  weak  edges. 

Given  an  array  of  edge  points,  the  points  must  be  chained  together  into  con¬ 
tours.  A  simple  method  is  to  chain  together  neighboring  points  whenever  there  is  an 
unambiguous  eight-way  neighbor.  Otherwise,  a  new  chain  is  started.  Chains  with 
low  overall  edge  strength  are  then  discarded.  Thresholding  whole  chains  rather 
than  individual  edge  points  produces  a  more  stable  output.  Finally,  edge  chains 
with  unambiguous  nearest  neighbors  are  merged  together  if  they  can  be  connected 
by  a  smooth  spline,  without  intersecting  another  edge  contour. 

Once  pieces  of  edge  contour  have  been  chained  together,  simple  shape  de¬ 
scriptors  are  derived  using  the  local  curvature  of  the  edge  contours,  f  The  edge 
contours  are  segmented  by  breaking  the  contour  at  zero  crossings  of  curvature  (in¬ 
flection  points  in  the  contour),  and  at  the  ends  of  low-curvature  regions;  producing 
straight,  positive  curvature  and  negative  curvature  segments.  As  discussed  above, 
inflections  were  chosen  as  segmentation  points  because  they  are  relatively  stable 
under  three-dimensional  rotation  and  projection. 

Multi-Scale  Descriptions 

The  purpose  of  labeling  the  edge  segments  is  to  produce  distinctive  labels  for  use 
in  pairing  together  potentially  matching  model  and  image  points.  Most  recognition 
systems  form  distinctive  labels  by  using  local  context  to  describe  a  given  feature. 
The  problem  with  this,  however,  is  that  an  image  feature  may  be  labeled  using 
context  which  is  not  part  of  the  object  being  recognized  (e.g.,  as  in  LFF  [5]  and 
SCERPO  [21]). 

We  use  a  more  limited  form  of  context,  in  which  the  edge  contour  is  smoothed 
at  various  scales,  and  the  finer  scale  descriptions  are  used  to  label  the  coarser 
scale  segments.  In  other  words,  the  coarser  scale  segments  are  used  to  group  finer 
scale  segments  together.  This  produces  distinctive  labels  without  the  problem  of 
accidentally  using  context  from  a  different  object,  because  the  “context”  is  part  of 
the  same  edge  contour. 


fCurvature  is  computed  as  the  change  in  angle  (per  unit  arclength)  between  local  tangent 
vectors  at  neighboring  pixels.  The  tangents  are  computed  using  the  least  squares  best 
fit  line  over  a  small  local  neighborhood. 
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A  hierarchy  of  curve  segmentations  can  be  obtained  by  smoothing  the  curvature 
at  different  scales,  and  using  the  smoothed  curvature  to  segment  the  edge  contour. 
Smoothing  the  curvature  preserves  only  those  inflection  points  in  the  edge  contour 
(zero  crossings  of  curvature)  that  are  significant  at  a  given  scale.  Thus  coarser  scale 
segments  correspond  to  merging  neighboring  segments  at  finer  scales,  producing  a 
scale-space  [30]  hierarchy  of  segments.  Since  coarser  scales  of  smoothing  do  not 
introduce  zero  crossings  that  were  not  present  at  finer  scales  [31],  the  hierarchy 
forms  a  tree  of  segments  from  coarser  to  finer  scales. 

Figure  6  shows  a  three-level  scale-space  curvature  segmentation  of  the  edge 
contours  of  the  widget  from  Figure  1.  Each  part  of  the  figure  shows  the  same 
contour,  segmented  according  to  the  curvature  smoothed  at  different  scales  (using 
Gaussian  filters  of  size  a  =  7,  20,  and  40  pixels,  respectively).  The  coarsest  scale 
is  at  the  top  of  the  figure  and  the  finest  scale  is  at  the  bottom.  The  endpoints  of 
each  segment  are  delimited  by  a  dot,  and  straight  regions  (at  that  scale)  are  shown 
in  bold.  Each  segment  is  labeled  with  a  letter,  and  a  number  denoting  the  level  (1 
is  coarsest  and  3  is  finest). 


Figure  6.  A  scale-space  segmentation  of  a  widget,  where  the  contours  are  seg¬ 
mented  at  inflections  in  the  smoothed  curvature.  The  coarsest  scale  is  at  the 
top.. 


Each  segment  of  edge  contour  is  classified  according  to  whether  it  is  curved  or 
straight.  The  curved  segments  are  further  classified  by  the  degree  of  closure: 
open  or  closed,  and  the  smoothness  of  the  contour:  smooth  or  unsmooth,  yielding 
a  total  of  five  types  of  segments.  Richer  descriptions  are  then  obtained  by  combining 
the  classifications  at  multiple  scales  of  smoothing. 
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This  (multi-rooted)  segmentation  tree  can  also  be  viewed  in  terms  of  the  corre¬ 
spondence  between  regions  at  neighboring  scales,  as  show  in  Figure  7.  Each  region 
at  a  coarse  scale  corresponds  to  one  or  more  regions  at  the  next  finer  scale.  Each 
segment  in  the  tree  is  indicated  by  its  label  from  Figure  6  and  by  t’  e  type  of 
segment:  straight,  curve,  and  open-curve. 
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Figure  7.  The  tree  corresponding  to  the  curvature  scale-space  segmentation  in 

Figure  6. 

Multi-scale  descriptions  are  formed  using  the  types  of  segments  at  a  given  icale, 
plus  the  structure  of  the  hierarchy  at  the  next  finer  scale.  Two  aspects  of  tb  :  tree 
structure  are  used.  First,  a  segment  is  classified  according  to  whether  it  corres  >onds 
(primarily)  to  one  or  many  segments  at  the  next  finer  scale:  single  or  multiple. 
Second,  a  multiple  segment  is  classified  according  to  whether  or  not  the  finer  scale 
segments  form  a  regular  pattern.  A  pattern  consists  of  a  repetition  of  the  same  type 
of  segment,  in  either  the  same  or  opposite  directions  of  curvature.  This  yields  the 
classification  irregular,  regular-same,  and  regular-opposite. 

For  example,  using  this  multi-scale  description,  the  straight  segment  Al  at 
level  1  in  the  tree  is  differentiated  from  the  other  straight  segments  Cl  and  El  at 
the  same  level,  because  Al  is  composed  of  multiple,  irregular  segments  at  the 
next  level  whereas  Cl  and  El  axe  each  composed  of  a  single  segment. 

Using  the  multi-scale  description,  at  the  coarsest  scale  the  widget  is  composed 
of  seven  segments,  only  two  of  which  can  be  confused  with  one  another  (the  two 
straight  segments  Cl  and  El).  The  seven  segments  are,  Al:  straight,  multi¬ 
irregular;  Bl:  curve,  single,  smooth;  Cl:  straight,  single;  Dl:  open-curve, 
single;  El:  straight,  single;  FI:  curve,  multi-irregular;  Gl:  closed-curve, 
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single,  smooth.  Thus,  the  labels  resulting  from  this  multi-level  description  are 
highly  distinctive  even  though  the  individual  features  are  coarse,  and  relatively 
sparsely  distributed  in  the  image. 

Since  the  coarse  scale  segments  are  relatively  distinct,  alignment  points  are 
chosen  using  a  coarse  scale  segmentation.  Each  segment  defines  a  point  used  for 
alignment,  based  on  its  type.  For  closed  segments  of  contour,  the  center  of  the  region 
defined  by  the  contour  is  used.  This  point  is  found  from  the  intersection  of  the  major 
and  minor  axes  of  the  region.  For  straight  segments  the  endpoints  are  used,  and 
for  other  curved  segments  the  middle  of  the  curve  is  used.  Since  the  endpoints  of 
the  segments  are  at  inflection  points  or  the  ends  of  zero  curvature  regions,  they  are 
relatively  stable,  making  it  reasonable  to  use  endpoints  and  midpoints  for  matching. 

Each  alignment  point  is  labeled  with  the  type  of  its  coarse  scale  segment. 
In  addition,  local  orientation  measures  are  defined  for  straight  and  open-curve 
segments,  for  use  in  eliminating  the  alignments  specified  by  inconsistent  triplets  (as 
described  in  Section  3). 

Using  coarse  scale  regions  for  choosing  alignment  points  reduces  the  number  of 
points,  while  retaining  relatively  distinctive  labels.  It  is  often  possible,  however,  to 
achieve  a  more  accurate  alignment  by  using  intermediate  or  finer  scale  descriptions. 
Therefore,  the  recognizer  first  finds  the  best  alignment  using  coarse-scale  features, 
and  then  attempts  to  improve  upon  it  by  performing  a  second  alignment  with  finer 
scale  features.  Since  the  model  and  the  image  are  already  almost  aligned,  this 
secondary  alignment  is  relatively  fast.  In  general,  each  partially  aligned  model 
feature  has  only  one  corresponding  image  feature,  and  the  correspondence  of  model 
and  image  features  is  correct. 

Recognition  Examples 

The  recognizer  is  implemented  on  a  Symbolics  3650,  and  takes  from  2-5  minutes  for 
each  of  the  examples  shown  in  this  section,  using  a  pre-computed  model.  First  a 
multi-scale  description  of  the  edge  segments  is  formed  and  used  to  define  alignment 
points  in  the  image,  as  described  in  the  previous  section.  Possible  alignments  are 
then  computed  using  triplets  of  model  and  image  point  pairs,  as  discussed  in  Sec¬ 
tion  3.  For  each  alignment,  the  model  is  mapped  into  the  image  and  the  transformed 
model  edges  are  correlated  with  the  image  edges.  The  al.  -  aments  are  ranked  based 
on  the  percentage  of  the  model  edge  contour  for  which  there  is  a  corresponding 
image  edge  contour.  The  recognizer  returns  the  best  alignment  accounting  for  each 
part  of  the  image  which  is  matched  by  the  model. 

We  present  several  examples  of  the  recognizer  processing  grey-scale  images  of 
widgets.  The  model  is  the  multi-scale  description  of  the  widget  shown  in  Figure  6 
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and  Figure  7.  The  inode!  is  hist  the  result  of  processing  the  image  i  t  die  isolated 
widget  in  Figure  1  in  the  -a me  manner  as  any  image.  Tims  for  Hm  ms.  models 
are  formed  directly  from  an  image  of  an  object. 

Each  example  is  illustrated  by  a  figure  with  four  parts:  a!  the  .m  y  level  image 
of  the  model  widget,  with  the  intensity  edges  superimposed,  i>)  the  guy.  level  image, 
c)  the  image  with  the  intensity  edges  superimposed,  and  d)  the  edges  of  the  aligned 
model  superimposed  on  d  ■  linage.  Part  (d)  also  indicates  the  poiids  which  were 
used  in  computing  the  aiigmu, .  of  the  model  with  the  image. 


Figure  8.  Matching  a  widget  against  an  image  of  two  widgets  in  the  plan*-,  a)  the 
grey-level  image  an*!  mi-.in*v  edges  of  the  model,  b)  the  grey-level  image,  c)  the 
image  and  the  niie.'.M*  a  ;>s,  «i)  the  edges  of  the  aligned  model  superimposed 
on  (c),  with  the  iii.-,  ■  p.  .  us  marked 


The  example  in  Figure  '•  -  -vs  i  w. »  widgets  in  the  plane.  The  top  widget  has  been 
flipped  over,  and  dm.  .  r*  *  ognized  using  only  two  <1  :nen>:omd  transfor¬ 
mations.  The  rc-n.-gp.i.*.  -.  . . distinct  positions  and  ori*-i  -  a 'ions  of  the  nodel 

that  match  90  and  1  .  !.  u:  *d«  1  edge  <-ontot:r  to  mu . •>  The- e  two 

matches  are  ■- 1 . .  *  .  .  i  i.c  image  in  par.  '■•!'  '  • 

Another  position  ami  .mentation  is  found  at  the  alignment  stage,  but.  is  elim¬ 
inated  because  the  cor  r<  iatnu  with  tin'  image  is  poor,  and  the  image  edges  are 
accounted  for  by  a  bet’e:  ,  i,  am'-nt  This  alignment  is  shown  in  Fie; are  9  superim¬ 
posed  with  the  iinag.  .  U  .  i  h  ■  ahrmnent  i-  f.und  b. .  muse  t  he  :  ...  w  might  edges 
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Figure  9.  An  alignment  of  a  widget  with  an  image  that  does  not  match  the 
model  edge  contour  with  image  edges. 


Figure  10.  Matching  a  widget  against  an  image  of  a  foreshortened  widget,  a)  the 
grey-level  image  and  intensity  ede.es  of  the  model,  hi  the  grey-level  image,  c)  the 
image  and  the  intensity  edges,  d)  the  edges  of  the  aligned  model  superimposed 
on  (c),  with  the  alignment  points  marked. 

are  indistinguishable,  and  the  three  points  used  in  cm.,-  rn'ing  the  alignment  were 
the  two  straight  edges  and  the  bend. 

Figure  10  shows  a  widget  that  has  been  tilted  approximately  30  degrees  by 
resting  one  end  of  it  on  a  block,  foreshortening  the  image.  The  recognizer  finds  a 
single  best  position  and  orientation,  which  is  shown  in  part  (d)  of  the  figure. 

The  next  example,  in  Figure  1  1.  shows  another  widget  that  has  been  tilted  out 
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Figure  11.  Matching  a  widget  against  an  image  of  a  tilted  widget,  a)  the  grey- 
level  image  and  intensi'y  edges  of  the  model,  b)  the  grey-level  image,  c)  the 
image  and  the  intensity  edges,  d)  the  edges  of  the  aligned  model  superimposed 
on  (c),  with  the  alignment  points  marked. 


of  the  image  plane  —  for  instance  the  circular  end  is  now  an  oval.  The  best  match 
is  shown  in  part  (d)  of  the  figure. 

Finally,  we  demonstrate  the  ability  of  the  recognizer  to  find  partly  occluded 
objects.  As  long  as  three  features  are  visible  in  the  image,  the  alignment  algorithm 
will  be  able  to  align  the  model  with  the  image.  Figure  12  shows  a  widget  that  has  a 
pile  of  smaller  widgets  obscuring  the  circular  end.  The  best  alignment  matches  80% 
of  the  model  edge  contour  to  image  edges,  and  is  shown  in  part  (d)  of  the  figure. 
Figure  13  shows  two  widgets  obscured  by  each  other  and  several  smaller  objects. 
The  matcher  finds  two  distinct  positions  and  orientations,  which  are  shown  in  part 
(d)  of  the  figure. 

From  these  examples  we  see  that  the  alignment  algorithm  finds  a  small  number 
of  reasonable  matches  of  widgets  to  images,  even  when  the  widget  is  foreshortened, 
scaled,  and  partly  orchid*  <t.  The  scoring  method  of  transforming  the  model  edges 
and  correlating  them  with  the  image  edges  provides  a  simple  method  for  finding 
the  best  alignment.  While  this  scoring  method  suffices  for  the  examples  considered 
here,  it  may  be  too  simple  in  the  general  case.  For  instanee,  it  inav  be  desiraole  to 
have  different  parts  of  the  model  carry  different  weights  in  sc<  jng  the  goodi  ess  of 
a  match. 
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igure  12.  Matching  a  widget  against  an  image  of  a  partly  occluded  widget, 
)  the  grey-level  image  and  intensity  edges  of  the  model,  b)  the  grey-level  im- 
ge,  c)  the  image  and  the  intensity  edges,  d)  the  edges  of  the  aligned  model 
superimposed  on  (c),  with  the  alignment  points  marked. 


The  3D  from  2D  Alignment  Method 


This  section  presents  the  alignment  method  in  detail.  It  is  shown  that  the  position, 
orientation,  and  scale  of  an  object  in  three-space  can  be  determined  from  a  two- 
dimensional  image  using  three  pairs  of  corresponding  model  and  image  points. 

The  description  of  the  alignment  method  is  di-.  ided  into  three  parts.  First  we 
discuss  the  use  of  orthographic  projection  and  a  linear  scale  factor  to  approximate 
perspective  viewing.  Then  we  present  the  alignment  method  using  explicit  three- 
dimensional  rotation.  Finally  we  present  a  version  of  the  alignment  method  that 
simulates  three-dimensional  rotation  using  planar  operations. 


Perspective  Projection 


Consideration  of  how  to  align  an  object  with  a  two-dimensional  image  raises  the 
issue  of  what  model  of  projection  to  use.  The  imaging  characteristics  of  cameras 
and  human  eyes  are  well  approximated  by  the  perspective  projection  model.  Under 
perspective  projection,  imaging  size  is  inversely  proportional  to  the  distance  from 
an  object  to  t  lie  center  of  projection,  as  illustrated  in  part  (a)  of  Figure  14.  Imaging 


Alignment 


24 


Figure  13.  Matching  a  widget  against  an  image  of  two  partly  occluded  widgets, 
a)  the  grey-level  image  and  intensity  edges  of  the  model,  b)  the  grey-level  im¬ 
age,  c)  the  image  and  the  intensity  edges,  d)  the  edges  of  the  aligned  model 
superimposed  on  (c),  with  the  alignment  points  marked. 

size  is  also  dependent  on  /,  the  focal  length  of  the  device,  which  can  be  assumed  to 
be  constant  for  a  given  sensor. 

Under  orthographic  projection,  on  the  other  hand,  imaging  size  is  independent 
of  distance.  Therefore  all  positions  along  the  viewing  axis  are  indistinguishable 
from  one  another,  as  illustrated  in  part  (b)  of  Figure  14.  Orthographic  projection  is 
simpler  to  solve  for  than  perspective  projection,  but  is  not  necessarily  an  adequate 
model  of  actual  viewing  conditions. 

There  are  two  major  practical  consequences  of  perspective  viewing.  The  first  is 
that  objects  that  are  further  away  look  smaller.  The  second  is  that  objects  that  are 
large  relative  to  the  viewing  distance  appear  distorted,  because  the  distant  parts  of 
an  object  project  smaller  images  than  the  closer  parts  do.  People  appear  to  have 
difficulty  recognizing  objects  under  conditions  of  high  projective  distortion. 

For  objects  that  are  not  large  relative  to  the  viewing  distance,  the  major  effect 
of  perspective  projection  is  scaling  proportional  to  the  distance  of  the  object.  There¬ 
fore,  in  these  cases  perspective  projection  can  be  well  approximated  by  orthographic 
projection  plus  a  linear  scale  factor. 

Orthographic  projection  plus  scale  can  be  solved  for  unambiguously  using  three 
pairs  of  model  and  image  points,  as  described  in  the  next  se>  tion.  Under  ortho- 
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Figure  14.  a)  perspective,  and  b)  orthographic  projection. 


graphic  projection,  position  along  the  viewing  axis  and  reflection  about  the  view 
plane  are  indistinguishable.  Thus  the  solution  is  unique  up  to  position  in  z  and 
reflection  about  the  z  —  0  plane. 

An  unambiguous  solution  for  the  position  and  orientation  of  an  object  under 
perspective  projection  can  require  up  to  six  pairs  of  model  and  image  points  [13]. 
Using  three  pairs  of  points  there  can  be  up  to  four  distinct  perspective  solutions. 
It  is  interesting  to  note  that  the  ambiguous  cases  involve  positions  and  orientations 
where  part  of  the  object  is  extremely  close  to  the  viewer  and  part  is  far  away  - 
cases  where  there  is  high  perspective  distortion.  Thus  three  points  are  sufficient  to 
solve  for  position  and  orientation  under  perspective  viewing  if  solutions  with  high 
perspective  distortion  are  discarded. 


The  Explicit  Three-Dimensional  Method 

Consider  three  model  points  am,  6m  and  cm  and  three  corresponding  image  points 
a.i,  b{  and  c;,  where  the  model  points  specify  three-dimensional  positions,  (r,y,z), 
and  the  image  points  specify  positions  in  the  image  plane,  (x,  y,  0).  The  alignment 
task  is  to  find  a  transformation  that  maps  the  plane  d(  fined  by  the  three  model 
points  onto  the  image  plane,  such  that  each  model  point  coincides  with  its  corre¬ 
sponding  image  point.  If  no  such  transformation  exists,  then  the  alignment  process 
must  determine  this  fact. 

Since  the  viewing  direction  is  along  the  z-axis,  an  alignment  is  a  transformation 
that  positions  the  model  such  that  am  projects  along  the  z-axis  onto  a  j,  and  similarly 
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for  bm  onto  6,,  and  cm  onto  c,.  The  transformation  consists  of  translations  in  x  and 
y,  and  rotations  about  three  orthogonal  axes.  There  is  no  translation  in  z  because  all 
points  along  the  viewing  axis  are  equivalent  under  orthographic  projection.  Instead, 
distance  from  the  viewer  is  reflected  by  a  change  in  scale.  First  we  show  how  to  solve 
for  the  alignment  assuming  no  change  in  scale,  and  then  modify  the  computation 
to  allow  for  a  linear  scale  factor. 

The  first  step  in  finding  an  alignment  is  to  translate  the  model  points  so  that 
one  point  projects  along  the  z-axis  onto  its  corresponding  image  point.  Using  the 
point  am  for  this  purpose,  the  model  points  are  translated  by  (x0i  —  xam ,  yai  ~Vam,  0), 
yielding  the  model  points  a'ml  b'm  and  c'm.  This  brings  a'mi,  the  projection  of  a'm 
into  the  image  plane,  into  correspondence  with  a,  as  illustrated  in  Figure  15a. 


Now  it  is  necessary  to  rotate  the  model  about  three  orthogonal  axes  to  align 
bm  and  cm  with  their  corresponding  image  points.  First  we  align  one  of  the  model 
edges  with  its  corresponding  image  edge  by  rotating  the  model  about  the  z-axis. 
Using  the  a'mb'm  edge  we  rotate  the  model  by  the  angle  between  the  image  edge 
atbi,  and  the  projected  model  edge  a'mib'mi,  yielding  the  new  model  points  b^  and 
c'^,  as  illustrated  in  Figure  15b. 


To  simplify  the  presentation,  the  coordinate  axes  are  now  shifted  so  that  a i  is 
the  origin,  and  the  x-axis  runs  along  the  ah  edge. 

Because  b'^,  the  projection  of  into  the  image  plane,  lies  along  the  x  axis, 
it  can  be  brought  into  correspondence  with  by  simply  rotating  the  model  about 
the  y- axis.  The  amount  of  rotation  is  determined  by  the  relative  lengths  of  ambm 
and  a,bi ,  because  the  model  must  be  rotated  such  that  the  projected  model  edge 
is  the  same  length  as  the  image  edge.  If  the  model  edge  is  shorter  than  the  image 
edge,  then  there  is  no  such  rotation,  and  hence  the  model  cannot  be  aligned  with 
the  image. 


Lift 

bm 


Thus,  the  model  points  b and  c'4,  are  rotated  about  the  y-axis  by  <f>  to  obtain 
and  o'",  where 


cos  4>  = 


\\h-  (1,0,0)|| 


Pm  -(1,0,0)11 

for  0  <  cos  <j>  <  1.  The  result  of  this  rotation  is  illustrated  in  Figure  15c. 


Finally,  c'^  is  brought  into  correspondence  with  a  by  rotation  about  the  x- 
axis.  The  degree  of  rotation  is  again  determined  by  the  relative  lengths  of  model 
and  image  edges.  In  the  previous  case,  however,  the  edges  were  parallel  to  the 
x-axis,  and  therefore  the  length  was  the  same  as  the  x  component  of  the  length.  In 
this  case,  the  edges  need  not  be  parallel  to  the  y  axis,  and  therefore  the  y  component 
of  the  lengths  must  be  used.  Thus,  the  rotation  about  the  x-axis  is  9,  where 


cos  9  = 


P.-- ((U,0)H 

Pm  -(0,1,0)11 


(2) 
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for  0  <  cos  9  <  1 . 

If  the  model  distance  is  shorter  than  the  image  distance,  there  is  no  trans¬ 
formation  that  aligns  the  model  and  the  image.  Furthermore,  if  the  rotation  does 
not  actually  bring  c'"'  into  correspondence  with  c,,  then  there  is  also  no  alignment. 
This  latter  case  can  result  because  the  rotations  are  those  that  will  bring  the  points 
into  alignment  if  there  actually  is  a  consistent  solution.  H  there  is  no  solution  then 
it  may  still  be  possible  to  solve  for  the  rotations,  but  they  will  not  bring  all  three 
points  into  alignment. 

The  final  rotation  brings  the  plane  defined  by  the  three  model  points  into  cor¬ 
respondence  with  the  image  plane,  as  illustrated  in  Figure  15d.  This  combination 
of  translations  and  rotations  can  now  be  used  to  map  the  model  into  the  image, 
in  order  to  determine  if  the  object  is  in  fact  in  the  image  at  this  position  and 
orientation. 

Now  it  is  necessary  to  solve  for  a  linear  scale  factor  as  a  sixth  unknown,  in  order 
to  simulate  distance  from  the  viewer  by  a  scale  factor.  The  final  two  rotations  which 
align  bm  with  6*,  and  cm  with  c,  are  the  only  computations  affected  by  a  charge  in 
scale.  The  alignment  of  bm  involves  movement  of  bmi  along  the  x-axis,  whereas  the 
alignment  of  cm  involves  movement  of  cm,  in  both  the  x  and  y  directions. 

Because  the  movement  of  bmi  is  a  sliding  along  the  x-axis,  only  the  x-component, 
x*,  changes.  The  change  is  given  by  the  rotation  <f>  about  the  y-axis,  as  in  (1).  With 
a  scale  factor,  s,  this  becomes 

i'b  =  sxa(cos^).  (3) 

Similarly  the  movement  of  cmi  in  the  y  direction  is  given  by  the  rotation  9  about 
the  x-axis,  as  in  (2).  With  a  scale  factor  this  becomes 

y'c  =  syc(cos0).  (4) 

The  movement  of  cm,  in  the  x  direction  is  given  by  the  rotations  about  both 
the  x-  and  the  y-axis.  From  the  matrix  for  a  combined  rotation  about  the  x-  and 
y-  axis  we  obtain 

x'  =  (x  cos  9  4-  y  sin  4>  sin  9). 

Thus  with  the  scale  factor,  the  x  component  of  cm  is 

xc  —  s(xc  cos  9  4-  yc  sin<^>  sin#).  (5) 

Now  we  have  three  equations  in  the  three  unknowns,  s,  9 ,  and  <t>.  One  method 
to  solve  for  s  is  to  substitute  for  cos#,  sin#,  and  sintf>  in  (5).  From  (3)  we  know 
that, 

sin <t>  =  —Js2x2b  -  x'k2.  (6) 

SXb  » 

And  similarly  from  (4), 

sin#  =  —\Zs2yl  -  y? .  (7) 

syc 


^  A 
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Substituting  (6)  and  (7)  into  (5)  and  simplifying  yields 

52(x6x'  -  xcx'b)2  =  (s2x2b  -  x'b2)(s2y2  -  y'2). 

Expanding  out  the  terms  we  obtain 

s4(x2by2c)  -  s2(x2y'2  4-  x,2y2  +  (xbx'c  -  xcx'b)2)  +  x'by'2 , 

a  quadratic  in  s2 .  While  there  are  generally  two  possible  solutions,  it  can  be  shown 
that  only  one  of  the  solutions  will  specify  possible  values  of  cos  <f>  and  cos#  [28]. 

Having  solved  for  the  scale  of  an  object,  the  final  two  rotations  <fi  and  8  can  be 
computed  using  (1)  and  (2)  modified  to  account  for  the  scale  factor, 

||6.  -  (1,0,0)11 


and 


s(cos<f>)  = 


s(cos#)  = 


II*- 


(1,0,0)|| 
c<-  (0,1,0)|| 


,km- (0,1,0)11 

Thus  solving  for  scale  involves  the  computation  of  s,  and  slight  a  modification  to 
the  computation  of  the  final  two  rotations. 


The  Planar  Method 

For  planar  models,  the  three-dimensional  alignment  task  only  involves  mapping 
points  from  one  plane  to  another.  Therefore,  a  planar  model  can  be  aligned  with  an 
image  using  only  planar  operations.  In  effect,  the  actual  three-dimensional  rotation 
and  translation  are  simulated  in  the  plane. 

Initially  the  model  points  are  translated  and  rotated  such  that  a,  and  am  are 
coincident,  and  the  a,bi  and  ambm  edges  are  aligned.  Then  a  coordinate  shift  i- 
performed  to  place  cq  and  am  at  (0,0),  and  b,  and  b,n  on  the  x-axis,  as  shown 
in  Figure  16a.  This  is  the  two-dimensional  analog  i  t  the  translation  and  r-axis 
rotation  done  in  the  three-dimensional  alignment  algorithm  of  the  previous  section. 

Now,  rather  than  performing  three-dimensional  rotations  to  align  bm  with 
6,,  and  cm  with  c,,  the  model  is  simply  scaled  and  then  sheared.  In  the  three- 
dimensional  case,  there  is  a  rotation  about  the  y-axis  that  moves  bm  in  the  x- 
direction,  a  rotation  about  the  x-axis  that  moves  cm  in  the  y-direction,  and  a 
combined  rotation  about  the  two  axes  that  moves  cm  in  the  x-direction.  Thus  the 
planar  operations  must  simulate  these  three  motions. 

To  make  bm  coincident  with  6,,  simulating  the  rotation  about  the  y-axis.  it  is 
only  necessary  to  scale  the  x-coordinates  of  the  model  points  by 

II*.  II 
II M 

obtaining  b'm  and  c'm  as  illustrated  in  Figure  1Gb. 
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Figure  16.  The  planar  alignment  process:  a)  the  points  a,  and  am  are  brought 
into  correspondence  and  the  ab  edges  are  aligned,  b)  the  points  6,  and  bm  are 
brought  into  correspondence,  c)  the  point  cm  is  brought  onto  the  line  through 
c,  and  parallel  to  the  r-axis,  and  d)  the  points  c,  and  cm  are  brought  into 
correspondence. 
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Making  cm  coincident  with  c,  involves  two  steps.  First,  simulating  the  rotation 
about  the  x-axis,  the  y-coordinates  of  the  model  points  are  scaled  by 

Ik,  -(0,1)11 

l|Cm-  (0,1)|| 

This  puts  c'^  on  the  line  through  the  point  c,,  and  parallel  to  the  x-axis,  as  illus¬ 
trated  in  Figure  16c. 

Now  c"  moves  in  the  x -direction  such  that  it  is  coincident  with  c,.  For  the 
point  cm,  the  distance  d,  is 


In  general,  a  point  will  move  in  the  x-direction  proportional  to  its  y-position.  be¬ 
cause  points  further  from  the  x-axis  will  move  more  due  to  the  combined  effect  of 
the  two  three-dimensional  rotations.  Thus  the  general  form  for  computing  the  new 
x-coordinate  of  a  point  is 

x'  =  x  +  d  — 

yc, 

for  d  given  in  (8).  This  final  shearing  brings  all  three  points  into  correspondence, 
as  shown  in  Figure  16d. 

Unlike  the  three-dimensional  alignment  algorithm,  the  planar  alignment  algo¬ 
rithm  finds  an  alignment  for  any  triplet  of  model  and  image  points,  because  the  x 
and  y  scaling  and  shearing  can  bring  any  two  pairs  of  points  into  alignment.  Thus 
an  additional  point  must  be  used  to  ensure  that  the  alignment  is  consistent.  This 
also  provides  an  added  flexibility  in  matching,  however,  because  operations  such 
as  stretching  of  an  object  will  not  cause  the  matching  to  fail.  Then  the  fact  that 
there  are  separate  scale  factors  for  the  x  and  y  directions  will  allow  other  points  ir 
a  planar  model  to  be  aligned  with  a  stretched  image  of  an  object. 


7  Aligning  Non-Flat  Objects 


We  have  seen  how  to  align  and  match  flat  objects  which  have  three-dimensional 
positional  uncertainty.  In  this  section,  we  briefly  consider  several  means  of  extending 
the  alignment  method  to  the  domain  of  non-flat  objects. 

At  one  extreme,  an  object  can  be  represented  by  a  single  three-dimensional 
model.  Using  such  a  representation,  the  alignment  computation  is  the  same  as  for  a 
flat  object.  A  triplet  of  corresponding  model  and  image  points  are  used  to  determine 

the  position  and  orientation  of  the  model.  In  order  to  project  the  aligned  model 
info  the  image,  however,  it  is  necessary  to  determine  which  portions  of  the  object 
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are  visible  given  its  position  and  orientation.  This  is  accomplished  using  standard 
hidden  line  and  surface  elimination  algorithms  (cf.  [15]). 

In  order  for  an  alignment  of  a  three-dimensional  model  with  an  image  to  be 
valid,  it  must  be  the  case  that  the  three  points  used  in  alignment  are  all  visible 
given  the  position  and  orientation  of  the  object  that  they  specify.  Thus  for  each 
possible  alignment,  the  three  model  points  must  be  checked  to  make  sure  that  they 
are  not  on  hidden  edges  or  surfaces. 

A  single  three-dimensional  model  is  not  well  suited  to  the  problem  of  visual 
recognition,  because  it  does  not  explicitly  represent  information  about  viewpoint. 
Thus,  before  a  three-dimensional  model  can  be  matched  to  an  image,  hidden  lines 
and  surfaces  be  removed. 

Furthermore,  it  is  difficult  to  construct  a  single  three-dimensional  model  from 
image  data,  because  multiple  views  must  be  integrated  together.  While  there  are 
systems  for  building  models  from  multiple  views  [18],  the  problem  is  quite  difficult, 
and  often  requires  a  large  number  of  images.  Therefore,  most  vision  systems  require 
object  models  to  be  entered  explicitly,  rather  than  forming  them  automatically 
from  images.  Moreover,  forming  a  single  three-dimensional  model  discards  the 
viewpoint  information  which  is  necessary  for  recognition.  Thus,  viewpoint  must  be 
reconstructed  by  eliminating  hidden  surfaces. 

It  also  appears  to  be  the  case  that  people  do  not  use  a  single  three-dimensional 
model  of  an  object  for  recognition.  Matching  such  a  model  against  an  image  requires 
rotating  the  model  in  three-space  to  bring  it  into  correspondence  with  the  image 
(or  vice  versa).  There  is  evidence,  however,  that  people  are  quite  slow  at  three- 
dimensional  mental  rotation  [27].  Recognition,  on  the  other  hand,  is  extremely 
rapid.  This  suggests  that  people  may  not  manipulate  three-dimensional  models  in 
recognition. 

Planar  Models 

At  the  opposite  extreme,  it  is  possible  to  model  an  object  in  terms  of  planar  views. 
In  other  words,  to  use  the  2D  projection  of  an  object  from  a  given  viewing  position 
as  a  Hat  model.  For  this  kind  of  representation,  it  is  necessary  to  have  multiple 
rmxlels  of  a  single  object,  corresponding  to  different  possible  views. 

A  planar  view  of  an  object  will  only  match  images  that  are  taken  at  exactly 
the  same  viewing  angle  as  the  model  view.  At  any  other  angle,  the  object  is  only 
approximated.  Therefore  the  use  of  planar  models  comes  at  the  cost  of  some  inac¬ 
curacy  in  the  matching  process.  The  use  of  planar  models,  however,  simplifies  the 
recognition  process.  First,  a  flat  model  can  be  aligned  with  an  image  using  only  the 
planar  operations  of  translation,  rotation,  scaling  and  shearing  simulating  three- 
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dimensional  motion  in  the  plane.  Second,  it  is  not  necessary  to  eliminate  hidden 
surfaces  in  order  to  match  the  model  with  the  image. 


Figure  17.  a)  a  planar  model  of  a  cube,  b)  two  different  views  of  the  cube,  c) 
planar  alignments  of  the  model  with  the  views. 

A  planar  view  of  a  non-planar  object  specifies  only  two-dimensional  information 
about  points  that  actually  have  three-dimensional  position.  Therefore,  an  alignment 
of  such  a  model  with  an  image  will  only  correctly  transform  points  that  are  coplanar 
(in  three-dimensions)  with  the  three  model  points  used  for  alignment.  Other  model 
points  will  be  treated  as  if  they  are  the  same  plane  even  though  they  are  not.  The 
amount  of  error  this  introduces  is  proportional  to  both  the  distance  from  the  poinf 
to  the  match  plane,  and  the  difference  in  angle  between  the  model  viewpoint  and  the 
image  viewpoint.  For  an  object  like  a  cube,  which  is  p  ly  approximated  by  a  single 
plane,  the  error  rapidly  becomes  substantial  as  the  object  is  rotated.  Figure  17a 
shows  an  isometric  view  of  a  cube  which  will  serve  as  a  planar  model.  Part  (b)  of 
the  figure  shows  two  views  of  the  cube,  rotated  by  j  and  respectively.  In  part 
(c).  the  model  has  been  aligned  with  each  of  the  views,  using  the  three  alignment 
points  marked  by  dots.  Since  the  model  is  planar,  the  alignment  is  only  correct  for 
those  edges  which  are  coplanar  (on  the  actual  object.)  with  the  alignment  points. 

Multiple  Planar  Alignments 

An  intermediate  possible  representation  is  to  use  multiple  views,  rather  than  a 
single  object-centered  model,  but  not  have  the  views  be  simple  two-dimensional 
projections  of  the  object.  For  instance,  it  is  possible  to  use  a  “2.5D  representation 
which  encodes  a  given  view  of  an  object  in  terms  of  almost-planar  pieces.  Thus 
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each  planar  section  of  the  model  could  be  aligned  with  the  image  separately.  Using 
such  a  representation,  the  cube  model  in  Figure  17a  would  have  each  face  marked 
as  being  a  different  planar  piece. 

This  representation  requires  a  model  for  each  viewpoint  from  which  a  different 
set  of  planar  pieces  are  visible.  A  multiple-view  model  of  a  cube,  for  example, 
would  consist  of  three  views:  a  view  when  one  surface  is  visible,  a  view  when  two 
surfaces  are  visible,  and  a  view  when  three  surfaces  are  visible.  A  cube  with  each 
surface  distinctly  marked  would  have  twenty-six  different  views:  eight  views  with 
three  visible  surfaces  (one  for  each  vertex),  twelve  views  with  two  visible  surfaces 
(one  for  each  edge),  and  one  view  of  each  of  the  six  surfaces  alone. 


Figure  18.  Two  non-cubes  that  match  a  cube  using  the  locally-rigid  alignment 
algorithm. 

Using  this  representation,  and  aligning  each  model  plane  separately,  the  cube  model 
can  be  matched  perfectly  against  the  views  in  parts  (b)  and  (c)  of  Figure  17.  The 
problem  with  this  matching  scheme  is  that  it  is  too  general,  accepting  the  non-cubes 
in  Figure  18  as  cubes. 

This  over-generality  can  be  restricted  by  noting  that  the  alignments  for  the 
various  planes  must  be  related.  Even  without  knowing  the  true  three-dimensional 
angles  between  faces,  if  the  object  is  convex  then  it  cannot  be  rotated  such  that  both 
external  edges  become  lower  than  the  internal  edge  without  exposing  a  new  surface 
on  the  bottom.  The  convexity  of  the  external  contour  must  be  maintained.  This 
eliminates  the  non-cube  in  Figure  18a.  We  are  currently  investigating  the  use  of 
such  relations  to  limit  the  flexibility  of  the  multiple-alignment  method  of  matching. 

Human  recognition  of  three-dimensional  objects  from  two-dimensional  images 
is  also  characterized  by  overly  general  matching.  For  instance,  the  partial  pyramid 
in  Figure  19  cannot  actually  be  a  projection  of  a  solid  object,  even  though  it  appears 
to  be  one  [24].  In  order  to  correspond  to  a  real  object  the  three  numbered  edges 
must  come  together  at  a  point,  but  they  do  not.  This  over-g<  nerality  is  probably 
not  due  to  acuity  limitations,  as  evidenced  by  another  set  of  experiments. 
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Figure  19.  While  appearing  to  be  a  partial  pyramid,  this  figure  cannot  corre¬ 
spond  to  an  actual  solid  object  (from  [24]). 

Even  for  common  objects,  people  are  quite  lax  about  what  contours  they  accept 
as  being  real  projections  of  an  object.  For  instance,  in  deciding  whether  cube-like 
figures  axe  actually  projections  of  cubes  people  are  only  about  85%  correct  [24].  The 
errors  are  apparently  due  to  the  way  people  match  models  to  images,  rather  than 
limitations  in  visual  acuity.  This  is  demonstrated  by  the  fact  that  performance  can 
be  greatly  improved  using  special  tricks.  In  order  for  a  figure  to  be  a  projection  of 
a  cube,  the  angles  about  the  central  vertex  must  all  be  at  least  90  degrees.  Using 
this  rule,  people  can  score  nearly  100%  correct  on  the  cube  classification  task  [24], 


Local  Alignment 

In  order  to  perform  separate  alignments  for  each  object  plane,  it  is  necessary  to 
know  what  parts  of  the  object  are  nearly  coplanar.  One  possibility  is  to  use  “2.5B' 
models  that  indicate  which  parts  of  the  object  are  on  the  same  local  plane.  Thus 
the  models  still  specify  only  two-dimensional  posith  ual  information,  but  they  group 
edges  together  according  to  what  actual  object  plane  they  lie  on. 

Another  possibility  is  to  determine  where  the  different  object  planes  are  at 
recognition  time.  One  method  of  doing  this  is  to  use  the  pairs  of  model  and  image 
features  to  define  regions  for  local  alignment.  For  instance,  the  set  of  model  points 
can  be  triangulated,  and  then  each  triangular  region  can  be  aligned  with  the  image 
separately.  To  map  the  model  to  the  image,  all  the  model  points  inside  a  given 
triangle  are  transformed  using  the  alignment  for  that  ti: angle.  To  the  extent  that 
a  triangle  falls  on  a  nearly  planar  part  of  an  object,  this  will  produce  a  correct 
alignment  for  that  region.  By  aligning  each  locally  neighboring  triple  of  model 
features  found  in  the  triangulation,  rigidity  is  preserved  for  each  triangle,  but  not 
for  the  object  as  a  whole. 

Points  in  a  given  triangle  will  not  be  mapped  into  the  image  correctly  if  they  are 
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not  on  the  same  plane  of  the  object  as  the  three  points  defining  the  triangle.  Such 
points,  however,  will  usually  be  transformed  to  a  point  near  their  correct  position 
because  the  alignment  will  be  partially  correct.  Therefore  the  locally-rigid  alignment 
process  can  be  iterated,  starting  with  a  three-point  alignment,  and  using  the  partial 
match  to  pick  potential  corresponding  model  and  image  points  for  computing  more 
refined,  and  less  globally  rigid,  alignments. 

This  algorithm  starts  by  computing  a  standard  three-point  rigid  alignment.  If 
the  initial  alignment  does  not  match  any  portion  of  the  model  to  the  image,  then  no 
further  action  is  taken.  If  there  is  a  match,  however,  then  the  alignment  is  used  to 
pick  additional  corresponding  pairs  of  model  and  image  points.  The  model  points 
are  triangulated  and  each  triangle  is  aligned  separately.  This  process  is  repeated 
until  either  a  good  match  of  the  model  to  the  image  is  obtained,  or  the  triangles 
become  small. 

To  show  how  the  local  alignment  algorithm  works,  we  consider  aligning  two 
slightly  different  views  of  a  station  wagon.  First,  we  show  that  a  single  rigid  align¬ 
ment  is  not  sufficient  to  align  the  two  views.  Figure  20  shows  a  single  three- 
dimensional  alignment,  using  three  points  chosen  on  the  rear  of  the  station  wagon. 
Part  (a)  shows  the  ‘‘model”  image  that  will  be  transformed,  and  part  (b)  shows 
the  image.  The  three  corresponding  points  in  the  two  images  are  marked.  Part  (c) 
shows  the  synthetic  image  formed  by  aligning  the  “model”  points  with  the  image 
points  -  translating,  rotating  and  scaling  the  “model”  according  to  the  alignment 
points.  Part  (d)  shows  the  superposition  of  the  image  with  the  synthetic  image.  It 
can  be  seen  from  the  superposition  that  the  side  of  the  wagon  has  been  substantially 
foreshortened  in  the  two  images.  Therefore,  similarly  to  the  cube  in  the  previous 
section,  the  alignment  brings  the  rear  plane  of  the  wagon  into  good  agreement,  but 
does  not  do  so  for  the  side. 

Figure  21  shows  the  same  two  views  with  a  set  of  corresponding  model  and 
image  points  marked.  The  points  are  triangulated,  and  separate  local  alignments  are 
performed  for  each  triangle.  The  synthetic  image  in  part  (c)  is  formed  by  mapping 
each  pixel  from  the  “model”  image  according  to  the  transformation  specified  by 
the  triangle  that  the  pixel  falls  in.  Pixels  that  fall  in  no  triangle  are  transformed 
using  triangles  formed  between  the  four  corners  of  the  image  and  the  convex  hull 
of  the  point  set.  The  superposition  of  the  image  with  the  synthetic  image  in  part 
(d)  shows  that  the  alignment  is  very  good. 

Thus  by  using  a  relatively  small  number  of  corresponding  points,  two  different 
planar  views  of  an  object  can  be  matched  to  one  another  with  a  high  degree  of 
accuracy.  This  multiple  alignment  method  allows  three-dimensional  objects  with 
three-dimensional  positional  freedom  to  be  recognized  from  a  single  two-dimensional 
view  using  planar  models. 
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8  Summary 

Recognition  is  generally  viewed  as  a  search  through  the  space  of  possible  positions 
and  orientations  of  objects.  The  idea  of  the  alignment  approach  is  to  separate  this 
search  into  two  stages.  In  the  first  stage,  the  position,  orientation,  and  scale  of 
an  object  are  found  using  a  minimal  amount  of  information,  such  as  three  pairs  of 
model  and  image  points.  In  the  second  stage,  the  alignment  is  used  to  map  the 
object  model  into  image  coordinates  for  comparison  with  the  image. 

There  are  two  major  advantages  of  this  approach.  First,  by  using  a  small  fixed 
number  of  model  and  image  features,  we  avoid  structuring  recognition  as  search 
through  an  exponential  space.  Second,  by  storing  object  models  such  that  they 
are  aligned  with  one  another,  a  single  alignment  computation  can  be  used  to  map 
multiple  objects  into  the  image. 

The  key  observation  behind  the  approach  is  that  the  alignment  can  be  per¬ 
formed  with  a  small  amount  of  information.  For  example,  three  points  are  sufficient 
to  determine  the  position,  orientation  and  scale  of  a  rigid  object  in  three-space  from 
a  single  two-dimensional  image.  Similarly,  two  points  and  an  orientation  measure 
can  also  be  used  to  solve  for  this  alignment. 

We  have  implemented  a  recognizer  using  the  alignment  method.  This  system 
chooses  features  for  alignment  using  a  scale-space  segmentation  of  edge  contours. 
The  multiple  scale  description  is  used  for  choosing  reliable  alignment  points,  and  for 
associating  descriptive  labels  with  them.  Coarse  scale  segments  are  described  both 
in  terms  of  their  shape,  and  the  structure  of  the  scale-space  hierarchy  at  the  next 
finer  level.  This  produces  relatively  distinctive  features  for  use  in  finding  possible 
alignments  of  a  model  and  an  image. 

To  demonstrate  the  recognition  method,  severa1  examples  were  shown  of  the 
recognizer  finding  a  widget  with  arbitrary  three-dimensional  position  and  orienta¬ 
tion,  in  a  two-dimensional  image.  From  these  examples  it  can  be  seen  that  the 
alignment  algorithm  finds  a  small  number  of  reasonable  matches  of  widgets  to  im¬ 
ages,  even  when  the  widget  is  foreshortened,  scaled,  and  partly  occluded. 

Finally,  we  have  briefly  discussed  how  the  alignment  method  can  be  extended 
to  recognize  rigid  objects  in  general,  either  by  using  a  three-dimensional  model,  or 
by  using  two-dimensional  models  and  multiple  local  alignments. 
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